Skip to main content

Grouping Data

Grouping data is an essential task for data analysis, allowing us to understand and manipulate data at a group level. Pandas provides the groupby() function to facilitate this, implementing the split-apply-combine pattern. This pattern involves splitting the data into groups, applying a function to each group, and then combining the results.

Splitting

To illustrate how groupby() works, let's use a dataset and go through an example step by step.

Importing Libraries and Loading Data

import pandas as pd
import numpy as np

# Load the US census data
df = pd.read_csv('datasets/census.csv')
# Exclude state-level summarizations
df = df[df['SUMLEV'] == 50]
df.head()

Grouping by State

First, let's calculate the average population for each state using a for loop and timing the operation.

%%timeit -n 3

for state in df['STNAME'].unique():
avg = np.average(df.where(df['STNAME'] == state).dropna()['CENSUS2010POP'])
print('Counties in state ' + state + ' have an average population of ' + str(avg))

Now, let's use groupby() to achieve the same result more efficiently.

%%timeit -n 3

for group, frame in df.groupby('STNAME'):
avg = np.average(frame['CENSUS2010POP'])
print('Counties in state ' + group + ' have an average population of ' + str(avg))

Using Functions for Grouping

We can also group data by using custom functions. For example, we can group states into batches based on the first letter of the state name.

df = df.set_index('STNAME')

def set_batch_number(item):
if item[0] < 'M':
return 0
if item[0] < 'Q':
return 1
return 2

for group, frame in df.groupby(set_batch_number):
print('There are ' + str(len(frame)) + ' records in group ' + str(group) + ' for processing.')

Grouping with Multi-Index

For more complex grouping, we can use a multi-index. Let's group Airbnb data by cancellation policy and review scores.

df = pd.read_csv("datasets/listings.csv")
df = df.set_index(["cancellation_policy", "review_scores_value"])

for group, frame in df.groupby(level=(0, 1)):
print(group)

We can also use a custom function to group the data.

def grouping_fun(item):
if item[1] == 10.0:
return (item[0], "10.0")
else:
return (item[0], "not 10.0")

for group, frame in df.groupby(by=grouping_fun):
print(group)

Applying Functions to Groups

Aggregation

Aggregation involves computing a summary statistic (e.g., mean, sum) for each group.

df = df.reset_index()
df.groupby("cancellation_policy").agg({"review_scores_value": np.nanmean})
df.groupby("cancellation_policy").agg({
"review_scores_value": (np.nanmean, np.nanstd),
"reviews_per_month": np.nanmean
})

Transformation

Transformation applies a function to each group, returning an object of the same size.

cols = ['cancellation_policy', 'review_scores_value']
transform_df = df[cols].groupby('cancellation_policy').transform(np.nanmean)
transform_df.rename({'review_scores_value': 'mean_review_scores'}, axis='columns', inplace=True)
df = df.merge(transform_df, left_index=True, right_index=True)
df['mean_diff'] = np.absolute(df['review_scores_value'] - df['mean_review_scores'])
df.head()

Filtering

Filtering removes entire groups based on a condition.

df.groupby('cancellation_policy').filter(lambda x: np.nanmean(x['review_scores_value']) > 9.2)

Applying Custom Functions

The apply() function allows applying an arbitrary function to each group and combining the results.

df = pd.read_csv("datasets/listings.csv")
df = df[['cancellation_policy', 'review_scores_value']]

def calc_mean_review_scores(group):
avg = np.nanmean(group["review_scores_value"])
group["review_scores_mean"] = np.abs(avg - group["review_scores_value"])
return group

df.groupby('cancellation_policy').apply(calc_mean_review_scores).head()